Purpose

The purpose of this markdown document is to list the steps we followed for refining the models using climate data only.

Data

The data used in the below models are described in the Data Wrangle folder.

Response variables

Note the below analysis uses the iNat data with 1510 observations. Amazing!

  • Response variables to consider
    • Number of unhealthy trees
    • Tree canopy symptoms

Below we compare models using canopy symptoms as the response variable

Explanatory variables

  • Explanatory variables included
    • iNat data
      • explanatory variables such as tree size
    • Climate data
      • 30yr normals 1991-2020 (265 variables)
      • last decade 2011-2020 (265 variables)

Data wrangling

There are multiple methods to group the response variables deepening on desired resolution or fineness of the model.

For now, we can move forward with the binary response grouping because it is the broadest and easiest for the model to classify with.

All tree health categories

## # A tibble: 12 × 2
## # Groups:   field.tree.canopy.symptoms [12]
##    field.tree.canopy.symptoms                             n
##    <fct>                                              <int>
##  1 Branch Dieback or 'Flagging'                          74
##  2 Browning Canopy                                       43
##  3 Candelabra top or very old spike top (old growth)      2
##  4 Extra Cone Crop                                        4
##  5 Healthy                                              813
##  6 Multiple Symptoms (please list in Notes)              31
##  7 New Dead Top (red or brown needles still attached)    48
##  8 Old Dead Top (needles already gone)                  154
##  9 Other (please describe in Notes)                      16
## 10 Thinning Canopy                                      225
## 11 Tree is dead                                          74
## 12 Yellowing Canopy                                      26

We also need to filter the data to only include response and explanatory variables we’re interested in. For example, whether a sound clip was included in the iNat data is not important.

We also need to remove other response variables like “field.percent.canopy.affected….” so it is not used as a predictor for tree health.

Note it might be interesting to know if the user was an important factor in predicting if the tree is healthy/unhealthy.

There are also a number of factors that should probably be removed because they may be biasing the data. For example, only trees with the ‘other factor’ question may only be answered for unhealthy trees. We need to think about this a bit more.

Remove variables with variables that have near zero standard deviations (entire column is same value)

Group response variables to binary values

Binary tree health categories

## # A tibble: 2 × 2
## # Groups:   field.tree.canopy.symptoms [2]
##   field.tree.canopy.symptoms     n
##   <fct>                      <int>
## 1 Healthy                      813
## 2 Unhealthy                    695

Approach

Models

Comparing full and reduced datasets of explanatory variables

  • Datasets compared below
    • Full model = all climate variables retained.
    • Monthless model = filtered out the climate parameters for individual months
    • Normal only, monthless model = removed decadal data

Full Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 21
## 
##         OOB estimate of  error rate: 30.7%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       586       227   0.2792128
## Unhealthy     236       459   0.3395683

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

Monthless Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = monthless.binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 30.97%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       582       231   0.2841328
## Unhealthy     236       459   0.3395683

Normal, Monthless Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = normal.monthless.binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 30.57%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       584       229   0.2816728
## Unhealthy     232       463   0.3338129